Allow `PermutedDimsArray` in `gemm_strided_batched` #539

mcabbott · 2020-11-11T09:04:31Z

This is copied from JuliaGPU/CuArrays.jl#664, which needed FluxML/NNlib.jl#191, now merged (but not yet tagged).

Summarising:

CUBLAS's gemm_strided_batched! accepts strides apart from the ordered ones of a CuArray. This is potentially useful as it saves on performing permutedims before operations, and the obvious way to expose this is to let it accept PermutedDimsArrays.

I presume that by the time anyone is calling CuArrays.CUBLAS.gemm_strided_batched!, they are well aware that the operation is to be done by CUBLAS, and not expecting dispatch to redirect to the CPU if it's an Array. So I changed this to accept any AbstractArray.

Then NNlib.batched_mul will be a friendly function which dispatches every kind of array to the best routine. I hope I got this sorted out! To perform e.g. matrix * 3-tensor contractions like the table here, the matrix should be reshaped to another 3-array, with a size=1 dimension to indicate no batch index, or batched matrix-vector, etc.

maleadt · 2020-11-12T09:13:06Z

I presume that by the time anyone is calling CuArrays.CUBLAS.gemm_strided_batched!, they are well aware that the operation is to be done by CUBLAS, and not expecting dispatch to redirect to the CPU if it's an Array. So I changed this to accept any AbstractArray.

It serves as documentation too. Looking at the name of the function one could expect being able to use it with actual strided inputs (i.e. a non-contiguous view); which the DenseCuArray type specification prevents.

Please rebase on latest master to get CI.

mcabbott · 2020-11-12T09:18:07Z

It's a pity to lose the documentation aspect, but is there another way? Inserting some huge union isn't going to result in a readable methods(). Perhaps I could insert one more step to _gemm_strided_batched! with a wider signature, which the NNlib function calls?

Can rebase but I think CI won't work until NNlib is tagged.

copied from JuliaGPU/CuArrays.jl#664, needs FluxML/NNlib.jl#191

maleadt · 2020-11-12T12:19:51Z

Can rebase but I think CI won't work until NNlib is tagged.

Oh, right. You can always add an explicit dep to the Manifest if you want to run CI before that happens.

Perhaps I could insert one more step to _gemm_strided_batched! with a wider signature, which the NNlib function calls?

Eh, maybe just add a comment at the end of the line.
Does this actually test the PermuteDims case? Otherwise it might be easy to accidentally refactor this part of the changes away.

This reverts commit 0692e58.

not needed, as CUDA only supports Julia 1.5 and up.

mcabbott · 2020-11-14T22:26:21Z

Good point about tests, I've added a few.

And put NNlib#master in the manifest to test on CI. (Documentation still fails, I presume because it's on a different manifest.)

This needs something to make pointers, possibly just Base.unsafe_convert(::Type{CuPtr{T}}, A::PermutedDimsArray) = .... Where should this live? I mean should Adapt or some other package somehow take care of this, instead of writing it directly?

maleadt · 2020-11-16T07:02:06Z

This needs something to make pointers, possibly just Base.unsafe_convert(::Type{CuPtr{T}}, A::PermutedDimsArray) = .... Where should this live? I mean should Adapt or some other package somehow take care of this, instead of writing it directly?

These conversions live in array.jl:

CUDA.jl/src/array.jl

Lines 370 to 490 in 1139ffd

    
           ## views 
        
           # optimize view to return a CuArray when contiguous 
        
           struct Contiguous end 
        
           struct NonContiguous end 
        
           # NOTE: this covers more cases than the I<:... in Base.FastContiguousSubArray 
        
           CuIndexStyle() = Contiguous() 
        
           CuIndexStyle(I...) = NonContiguous() 
        
           CuIndexStyle(::Base.ScalarIndex...) = Contiguous() 
        
           CuIndexStyle(i1::Colon, ::Base.ScalarIndex...) = Contiguous() 
        
           CuIndexStyle(i1::AbstractUnitRange, ::Base.ScalarIndex...) = Contiguous() 
        
           CuIndexStyle(i1::Colon, I...) = CuIndexStyle(I...) 
        
           cuviewlength() = () 
        
           @inline cuviewlength(::Real, I...) = cuviewlength(I...) # skip scalar 
        
           @inline cuviewlength(i1::AbstractUnitRange, I...) = (Base.unsafe_length(i1), cuviewlength(I...)...) 
        
           @inline cuviewlength(i1::AbstractUnitRange, ::Base.ScalarIndex...) = (Base.unsafe_length(i1),) 
        
           @inline function Base.view(A::CuArray, I::Vararg{Any,N}) where {N} 
        
               J = to_indices(A, I) 
        
               @boundscheck begin 
        
                   # Base's boundscheck accesses the indices, so make sure they reside on the CPU. 
        
                   # this is expensive, but it's a bounds check after all. 
        
                   J_cpu = map(j->adapt(Array, j), J) 
        
                   checkbounds(A, J_cpu...) 
        
               end 
        
               J_gpu = map(j->adapt(CuArray, j), J) 
        
               unsafe_view(A, J_gpu, CuIndexStyle(I...)) 
        
           end 
        
           @inline function unsafe_view(A, I, ::Contiguous) 
        
               unsafe_contiguous_view(Base._maybe_reshape_parent(A, Base.index_ndims(I...)), I, cuviewlength(I...)) 
        
           end 
        
           @inline function unsafe_contiguous_view(a::CuArray{T}, I::NTuple{N,Base.ViewIndex}, dims::NTuple{M,Integer}) where {T,N,M} 
        
               offset = Base.compute_offset1(a, 1, I) * sizeof(T) 
        
               alias(a.baseptr) 
        
               b = CuArray{T,M}(a.baseptr, dims, a.ctx; offset=a.offset+offset) 
        
               finalizer(unsafe_free!, b) 
        
               b.state = ARRAY_MANAGED 
        
               return b 
        
           end 
        
           @inline function unsafe_view(A, I, ::NonContiguous) 
        
               Base.unsafe_view(Base._maybe_reshape_parent(A, Base.index_ndims(I...)), I...) 
        
           end 
        
           # pointer conversions 
        
           ## contiguous 
        
           function Base.unsafe_convert(::Type{CuPtr{T}}, V::SubArray{T,N,P,<:Tuple{Vararg{Base.RangeIndex}}}) where {T,N,P} 
        
               return Base.unsafe_convert(CuPtr{T}, parent(V)) + 
        
                      Base._memory_offset(V.parent, map(first, V.indices)...) 
        
           end 
        
           ## reshaped 
        
           function Base.unsafe_convert(::Type{CuPtr{T}}, V::SubArray{T,N,P,<:Tuple{Vararg{Union{Base.RangeIndex,Base.ReshapedUnitRange}}}}) where {T,N,P} 
        
              return Base.unsafe_convert(CuPtr{T}, parent(V)) + 
        
                     (Base.first_index(V)-1)*sizeof(T) 
        
           end 
        
           ## reshape 
        
           # optimize reshape to return a CuArray 
        
           function Base.reshape(a::CuArray{T,M}, dims::NTuple{N,Int}) where {T,N,M} 
        
             if prod(dims) != length(a) 
        
                 throw(DimensionMismatch("new dimensions $(dims) must be consistent with array size $len")) 
        
             end 
        
             if N == M && dims == size(a) 
        
                 return a 
        
             end 
        
             alias(a.baseptr) 
        
             b = CuArray{T,N}(a.baseptr, dims, a.ctx; offset=a.offset) 
        
             finalizer(unsafe_free!, b) 
        
             b.state = ARRAY_MANAGED 
        
             return b 
        
           end 
        
           # allow missing dimensions with Colon() 
        
           if VERSION < v"1.6.0-DEV.1358" 
        
           Base.reshape(parent::CuArray, dims::Tuple{Vararg{Union{Int,Colon}}}) = 
        
             Base.reshape(parent, Base._reshape_uncolon(parent, dims)) 
        
           end 
        
           ## reinterpret 
        
           # optimize reshape to return a CuArray 
        
           function Base.reinterpret(::Type{T}, a::CuArray{S,N}) where {T,S,N} 
        
             if N == 0 && sizeof(T) == sizeof(S) 
        
               throw(ArgumentError("cannot reinterpret a zero-dimensional `$(S)` array to `$(T)` which is of a different size")) 
        
             end 
        
             if N != 0 && sizeof(S) != sizeof(T) 
        
                 ax1 = axes(a)[1] 
        
                 dim = length(ax1) 
        
                 if Base.rem(dim*sizeof(S),sizeof(T)) != 0 
        
                   throw(ArgumentError(""" 
        
                       cannot reinterpret an `$(S)` array to `$(T)` whose first dimension has size `$(dim)`. 
        
                       The resulting array would have non-integral first dimension. 
        
                       """)) 
        
                 end 
        
                 if first(ax1) != 1 
        
                   throw(ArgumentError("cannot reinterpret a `$(S)` array to `$(T)` when the first axis is $ax1. Try reshaping first.")) 
        
                 end 
        
             end 
        
             isize = size(a) 
        
             size1 = div(isize[1]*sizeof(S), sizeof(T)) 
        
             osize = tuple(size1, Base.tail(isize)...) 
        
             alias(a.baseptr) 
        
             b = CuArray{T,N}(a.baseptr, osize, a.ctx; offset=a.offset) 
        
             finalizer(unsafe_free!, b) 
        
             b.state = ARRAY_MANAGED 
        
             return b 
        
           end

. We used to have tthem for ReshapedArray and ReinterpretArray too, but recently we've switched back to implementing those with CuArray again.

mcabbott · 2020-11-16T12:16:17Z

Great, I've moved the pointer as you describe.

NNlib is now tagged, and I adjusted the bound in Project.toml. But CI doesn't seem to see v0.7.7, perhaps it needs more time.

maleadt · 2020-11-17T07:46:39Z

Sometimes the PkgServer lags General a bit. I restarted the build, and there's a normal failure in the NNlib tests now.

codecov · 2020-11-17T08:57:40Z

Codecov Report

Merging #539 (0be7a0e) into master (cd1191f) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #539   +/-   ##
=======================================
  Coverage   77.82%   77.83%           
=======================================
  Files         116      116           
  Lines        6423     6425    +2     
=======================================
+ Hits         4999     5001    +2     
  Misses       1424     1424

Impacted Files	Coverage Δ
lib/cublas/wrappers.jl	`90.98% <100.00%> (+0.01%)`	⬆️
src/array.jl	`88.42% <100.00%> (+0.06%)`	⬆️
src/nnlib.jl	`14.28% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd1191f...0be7a0e. Read the comment docs.

maleadt · 2020-11-17T11:57:40Z

All green, let's merge this!

mcabbott · 2020-11-17T12:11:04Z

Thanks! And sorry about the excessive CI-ing, trying to finish this while away from my desktop maybe wasn't the brightest plan.

maleadt · 2020-11-17T16:10:25Z

No problem, that's what its there for. But it prompted me to add a smoke test that has to succeed before starting 10 other jobs 😄

mcabbott changed the title ~~allow PermutedDimsArray in gemm_strided_batched~~ Allow PermutedDimsArray in gemm_strided_batched Nov 11, 2020

mcabbott mentioned this pull request Nov 11, 2020

Make gemm_strided_batched! work with PermutedDimsArrays JuliaGPU/CuArrays.jl#664

Closed

Michael Abbott added 2 commits November 12, 2020 10:38

allow PermutedDimsArray in gemm_strided_batched

03744a6

copied from JuliaGPU/CuArrays.jl#664, needs FluxML/NNlib.jl#191

work via _gemm_strided_batched

0692e58

mcabbott force-pushed the batchedmul191 branch from f1aadc3 to 0692e58 Compare November 12, 2020 09:39

Michael Abbott added 10 commits November 13, 2020 19:26

Revert "work via _gemm_strided_batched"

63e23ba

This reverts commit 0692e58.

tests including permuted cases

3108086

add NNlib master

d7cc85b

add a comment

e68a5d2

CuPtr for PermutedDimsArrays

0cfb702

rm Compat.jl

2bfe979

not needed, as CUDA only supports Julia 1.5 and up.

CUDAdrv by mistake

71238ce

narrower pointer

ece15c7

oops

c35137e

rm Nothing

f7d46de

Michael Abbott added 4 commits November 16, 2020 09:21

move unsafe_convert(CuPtr...

2246bf9

free NNlib

ea67dda

test some more NNlib methods

a22a68d

mostly to re-trigger CI

e4c3dc2

one more pointer

0be7a0e

maleadt added cuda array Stuff about CuArray. enhancement New feature or request labels Nov 17, 2020

maleadt merged commit d16a4b3 into JuliaGPU:master Nov 17, 2020

mcabbott deleted the batchedmul191 branch November 17, 2020 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `PermutedDimsArray` in `gemm_strided_batched` #539

Allow `PermutedDimsArray` in `gemm_strided_batched` #539

mcabbott commented Nov 11, 2020

maleadt commented Nov 12, 2020 •

edited

Loading

mcabbott commented Nov 12, 2020

maleadt commented Nov 12, 2020

mcabbott commented Nov 14, 2020 •

edited

Loading

maleadt commented Nov 16, 2020

mcabbott commented Nov 16, 2020

maleadt commented Nov 17, 2020

codecov bot commented Nov 17, 2020 •

edited

Loading

maleadt commented Nov 17, 2020

mcabbott commented Nov 17, 2020

maleadt commented Nov 17, 2020

Allow PermutedDimsArray in gemm_strided_batched #539

Allow PermutedDimsArray in gemm_strided_batched #539

Conversation

mcabbott commented Nov 11, 2020

maleadt commented Nov 12, 2020 • edited Loading

mcabbott commented Nov 12, 2020

maleadt commented Nov 12, 2020

mcabbott commented Nov 14, 2020 • edited Loading

maleadt commented Nov 16, 2020

mcabbott commented Nov 16, 2020

maleadt commented Nov 17, 2020

codecov bot commented Nov 17, 2020 • edited Loading

Codecov Report

maleadt commented Nov 17, 2020

mcabbott commented Nov 17, 2020

maleadt commented Nov 17, 2020

Allow `PermutedDimsArray` in `gemm_strided_batched` #539

Allow `PermutedDimsArray` in `gemm_strided_batched` #539

maleadt commented Nov 12, 2020 •

edited

Loading

mcabbott commented Nov 14, 2020 •

edited

Loading

codecov bot commented Nov 17, 2020 •

edited

Loading